98 research outputs found
Selection and Estimation for Mixed Graphical Models
We consider the problem of estimating the parameters in a pairwise graphical
model in which the distribution of each node, conditioned on the others, may
have a different parametric form. In particular, we assume that each node's
conditional distribution is in the exponential family. We identify restrictions
on the parameter space required for the existence of a well-defined joint
density, and establish the consistency of the neighbourhood selection approach
for graph reconstruction in high dimensions when the true underlying graph is
sparse. Motivated by our theoretical results, we investigate the selection of
edges between nodes whose conditional distributions take different parametric
forms, and show that efficiency can be gained if edge estimates obtained from
the regressions of particular nodes are used to reconstruct the graph. These
results are illustrated with examples of Gaussian, Bernoulli, Poisson and
exponential distributions. Our theoretical findings are corroborated by
evidence from simulation studies
Video Captioning with Guidance of Multimodal Latent Topics
The topic diversity of open-domain videos leads to various vocabularies and
linguistic expressions in describing video contents, and therefore, makes the
video captioning task even more challenging. In this paper, we propose an
unified caption framework, M&M TGM, which mines multimodal topics in
unsupervised fashion from data and guides the caption decoder with these
topics. Compared to pre-defined topics, the mined multimodal topics are more
semantically and visually coherent and can reflect the topic distribution of
videos better. We formulate the topic-aware caption generation as a multi-task
learning problem, in which we add a parallel task, topic prediction, in
addition to the caption task. For the topic prediction task, we use the mined
topics as the teacher to train a student topic prediction model, which learns
to predict the latent topics from multimodal contents of videos. The topic
prediction provides intermediate supervision to the learning process. As for
the caption task, we propose a novel topic-aware decoder to generate more
accurate and detailed video descriptions with the guidance from latent topics.
The entire learning procedure is end-to-end and it optimizes both tasks
simultaneously. The results from extensive experiments conducted on the MSR-VTT
and Youtube2Text datasets demonstrate the effectiveness of our proposed model.
M&M TGM not only outperforms prior state-of-the-art methods on multiple
evaluation metrics and on both benchmark datasets, but also achieves better
generalization ability.Comment: ACM Multimedia 201
Distribution-Free Tests of Independence in High Dimensions
We consider the testing of mutual independence among all entries in a
-dimensional random vector based on independent observations. We study
two families of distribution-free test statistics, which include Kendall's tau
and Spearman's rho as important examples. We show that under the null
hypothesis the test statistics of these two families converge weakly to Gumbel
distributions, and propose tests that control the type I error in the
high-dimensional setting where . We further show that the two tests are
rate-optimal in terms of power against sparse alternatives, and outperform
competitors in simulations, especially when is large.Comment: to appear in Biometrik
Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data
Bilingual lexicon induction, translating words from the source language to
the target language, is a long-standing natural language processing task.
Recent endeavors prove that it is promising to employ images as pivot to learn
the lexicon induction without reliance on parallel corpora. However, these
vision-based approaches simply associate words with entire images, which are
constrained to translate concrete words and require object-centered images. We
humans can understand words better when they are within a sentence with
context. Therefore, in this paper, we propose to utilize images and their
associated captions to address the limitations of previous approaches. We
propose a multi-lingual caption model trained with different mono-lingual
multimodal data to map words in different languages into joint spaces. Two
types of word representation are induced from the multi-lingual caption model:
linguistic features and localized visual features. The linguistic feature is
learned from the sentence contexts with visual semantic constraints, which is
beneficial to learn translation for words that are less visual-relevant. The
localized visual feature is attended to the region in the image that correlates
to the word, so that it alleviates the image restriction for salient visual
representation. The two types of features are complementary for word
translation. Experimental results on multiple language pairs demonstrate the
effectiveness of our proposed method, which substantially outperforms previous
vision-based approaches without using any parallel sentences or supervision of
seed word pairs.Comment: Accepted by AAAI 201
- …